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This  paper  describes  a  circuit  transformation  called  retiming  in  which  registers  are  added 
at  some  points  in  a  circuit  and  removed  from  others  in  such  a  way  that  the  functional  behavior 
of  the  circuit  as  a  whole  is  preserved.  We  show  that  retiming  can  be  used  to  transform  a  given 
synchronous  circuit  into  a  more  efficient  circuit  under  a  variety  of  different  cost  criteria.  We 
model  a  circuit  as  a  graph  in  which  the  vertex  set  V  is  a  collection  of  combinational  logic 
elements  and  the  edge  set  E  is  the  set  of  interconnections,  each  of  which  may  pass  through  zero 
or  more  registers.  We  give  an  0(|  V|  |E|  lg  |  V|)  algorithm  for  determining  an  equivalent  retimed 
circuit  with  the  smallest  possible  clock  period.  We  show  that  the  problem  of  determining  an 
equivalent  retimed  circuit  with  minimum  state  (total  number  of  registers)  is  polynomial-time 
solvable.  This  result  yields  a  polynomial-time  optimal  solution  to  the  problem  of  pipelining 
combinational  circuitry  with  minimum  register  cost.  We  also  give  a  characterization  of  optimal 
retiming  based  on  an  efficiently  solvable  mixed-integer  linear  programming  problem. 
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1  Introduction 


The  goal  of  VLSI  design  automation  is  to  speed  the  design  of  a  system  without  sacrificing  the 
quality  of  implementation.  A  common  means  of  achieving  this  goal  is  through  the  use  of  opti¬ 
mization  tools  that  improve  the  quality  of  a  quickly  designed  circuit.  In  this  paper  we  show  how 
to  optimize  clocked  circuits  by  relocating  registers  so  as  to  reduce  combinational  rippling.  Unlike 
pipelining,  this  technique,  which  we  call  retiming,  does  not  increase  circuit  latency. 

In  order  to  illustrate  retiming,  consider  the  problem  of  designing  a  digital  correlator.  The 
correlator  takes  a  stream  of  bits  xq,  Xi  ,  12, . . .  as  input  and  compares  it  with  a  fixed-length  pattern 
ao,a\, . . .  ,dk.  After  receiving  each  input  r,  (i  >  k),  the  correlator  produces  as  output  the  number 
of  matches 

k 

y>  =  £*( ,  (1) 

j=0 

where  6  is  the  comparison  function 


if  x  =  y, 

otherwise. 


Figure  1  shows  a  design  of  a  simple  correlator  for  the  case  when  k  =  3.  Correlator  1  consists 
of  two  kinds  of  functional  elements,  adders  and  comparators,  whose  I/O  characteristics  are  shown 
in  the  figure.  The  boxes  between  the  comparators  are  registers  which  act  to  shift  the  x ,  to  the 
right  down  the  length  of  the  correlator.  On  each  tick  of  the  global  clock,  each  Xi  is  compared 
with  a  character  of  the  pattern,  and  the  adders  sum  up  the  number  of  matches. 


Figure  1.  Correlator  1'  A  simple  circuit  made  of  two  kind?  cf  functional  elements  Each  comparatoi 
6  has  a  propagation  delay  of  3  esec,  and  each  adder  +  has  a  propagation  delay  of  7  esec.  A  longest 
path  of  combinational  rippling  starts  at  the  register  on  the  connection  labeled  A,  and  thus  the  clock 
period  of  the  circuit  is  24  esec. 
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This  design,  though  easy  to  understand,  has  poor  performance.  Between  ticks  of  the  clock, 
the  partial  sums  of  the  matches  ripple  up  the  length  of  the  correlator.  Suppose,  for  instance,  that 
each  adder  has  a  propagation  delay  of  7  esec,1  and  each  comparator  has  a  propagation  delay  of 
3  esec.  Then  the  clock  period  must  be  at  least  24  esec— the  time  for  a  signal  to  propagate  from 
the  register  on  the  connection  labeled  A  through  one  comparator  and  three  adders. 

A  design  that  gives  better  performance  can  be  derived  by  removing  the  register  on  connection 
A  from  Correlator  1  and  inserting  a  new  register  on  connection  B,  as  shown  in  Figure  2.  To 
show  that  these  two  correlators  are  indeed  functionally  equivalent,  consider  the  portion  of  the 
circuit  surrounded  by  the  dashed  box  in  the  figure.  It  communicates  with  the  rest  of  the  circuit 
only  through  connections  A  and  B.  When  the  register  on  A  is  removed,  all  input  signals  to  this 
portion  of  the  circuit  arrive  one  clock  tick  earlier,  and  thus  the  boxed  portion  of  Correlator  2 
performs  the  same  sequence  of  computations  as  in  Correlator  1,  but  one  clock  tick  earlier.  Since 
the  output  from  the  boxed  portion  of  Correlator  2  is  delayed  one  clock  tick  by  the  new  register 
on  connection  B ,  the  remainder  of  the  circuit  sees  the  same  behavior  as  in  Correlator  1.  We  say 
that  the  three  functional  elements  in  the  boxed  portion  of  Correlator  2  lead  by  one  clock  tick  the 
corresponding  functional  elements  in  Correlator  1.  Alternatively,  we  say  that  the  three  elements 
in  Correlator  1  lag  by  one  clock  tick  the  corresponding  elements  in  Correlator  2. 


Figure  2.  Correlator  2:  A  retimed  circuit  functionally  equivalent  to,  but  more  efficient  than,  Corre¬ 
lator  1  .  The  longest  path  of  combinational  rippling  begins  at  the  register  on  connection  C,  and  the 
clock  period  of  this  circuit  is  17  esc c. 

Correlator  1  and  Correlator  2  are  functionally  equivalent,  but  the  performance  of  the  retimed 
circuit  Correlator  2  is  better  than  that  of  Correlator  1.  The  clock  period  of  Correlator  2  is  17 
esec — the  time  for  a  signal  to  propagate  from  the  register  on  connection  C  through  one  comparator 
and  two  adders.  Notice  that  the  two  designs  use  the  same  functional  elements  connected  in  the 
same  manner  and  differ  only  in  the  locations  of  registers.  Correlator  2  has  the  I/O  characteristic 
specified  by  Equation  (1),  but  it  should  be  apparent  that  a  direct  verification  requires  considerably 
more  effort  than  the  verification  of  Correlator  1. 

Retiming,  the  technique  of  inserting  and  deleting  registers  in  such  a  way  as  to  preserve 
function,  can  be  used  to  produce  an  even  faster  circuit  than  Correlator  2.  Section  4  gives  an 
implementation  of  the  correlator  that  achieves  a  clock  period  of  13  esec.  Remarkably,  if  the 
pattern  of  comparators  and  adders  is  extended  arbitrarily  to  the  right,  a  clock  period  of  14  esec 
can  always  be  achieved  by  retiming.  In  this  paper  we  exhibit  a  polynomial-time  algorithm  for 
determining  a  retiming  of  a  circuit  that  minimizes  clock  period. 

'Recall  that  one  eptosecond  (esec)  equals  one  one-zillionth  of  a  second. 
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The  remainder  of  this  paper  is  organized  as  follows.  Section  2  presents  the  graph- theoretic 
model  of  synchronous  circuits  used  in  this  paper.  In  Section  3  we  formally  describe  the  operation 
of  retiming  [14],  in  which  registers  are  deleted  from  some  connections  of  a  circuit  and  added 
to  others  so  that  the  circuit  function  is  preserved.  Section  4  gives  a  simple  polynomial- time 
algorithm  for  minimizing  the  clock  period  of  a  circuit.  Section  5  gives  an  asymptotically  more 
efficient  algorithm  to  solve  the  same  problem.  In  Section  6  we  show  that  the  problem  of  finding 
a  retiming  of  a  circuit  that  minimizes  clock  period  can  be  reduced  to  an  efficiently  solvable 
mixed-integer  linear  programming  problem,  thus  providing  a  framework  for  retiming  based  on 
mathematical  programming. 

Sections  7,  8,  and  9  discuss  extensions  of  these  results.  Section  7  considers  the  special  case 
where  all  functional  elements  have  identical  propagation  delays  and  shows  that  optimal  retimings 
can  be  found  more  efficiently  in  this  case.  The  section  also  discusses  the  relationship  of  this  work 
to  systolic  computation  and  shows  how  to  improve  the  performance  of  many  systolic  circuits  in 
the  literature.  While  earlier  sections  are  concerned  with  finding  retimings  that  are  optimal  in  the 
sense  of  minimizing  clock  period,  Section  8  examines  a  different  optimization  criterion,  namely 
minimizing  the  total  amount  of  state  (number  of  registers)  in  the  retimed  circuit.  In  particular, 
we  show  that  the  problem  of  retiming  a  circuit  to  minimize  its  state  subject  to  an  upper  bound  on 
the  clock  period  can  be  reduced  to  the  linear-programming  dual  of  a  minimum-cost  flow  problem, 
and  hence  can  be  solved  optimally  in  polynomial  time.  Section  9  extends  our  methods  to  a  more 
general  circuit  model  in  which  individual  functional  elements  may  have  nonuniform  propagation 
delays — e.g.,  the  low-order  output  bit  of  an  adder  may  be  available  earlier  than  the  high-order 
bit. 

In  Section  10,  we  briefly  mention  further  extensions,  including  the  application  of  our  algo¬ 
rithms  to  optimal  pipelining  of  combinational  circuitry. 

2  Preliminaries 

In  this  section  we  define  the  notations  and  terminology  needed  in  the  paper  and  present  our  graph- 
theoretic  model  of  digital  circuits.  We  conclude  by  giving  a  simple  algorithm  for  determining  the 
minimum  feasible  clock  period  of  a  circuit  from  its  graph. 

We  can  view  a  circuit  abstractly  as  a  network  of  functional  elements  and  globally  clocked 
registers.  The  registers  axe  assumed  to  have  the  following  characteristics:  each  has  a  single  input 
and  a  single  output;  all  are  clocked  by  the  same  periodic  waveform;  and  at  each  clock  tick,  each 
storage  element  samples  its  input  and  the  sampled  value  is  made  available  at  the  output  until 
the  next  tick.  We  also  assume  that  changes  in  the  output  of  one  storage  element  do  not  interfere 
with  the  input  to  another  at  the  same  clock  tick.  An  example  of  such  a  storage  element  is  an 
edge- triggered,  master-slave,  D-type  flip-flop  [21]. 

The  functional  elements  provide  the  computational  power  of  the  circuit.  Our  model  is  un¬ 
concerned  with  the  level  of  complexity  of  the  functional  elements — they  might  be  NAND  gates, 
multiplexors,  or  ALU’s,  for  example.  Each  functional  element  has  an  associated  propagation 
delay.  The  outputs  of  a  functional  element  at  any  time  are  defined  as  a  specified  function  of  its 
inputs,  provided  that  all  the  inputs  have  been  stable  for  a  time  at  least  equal  to  the  element’s 
propagation  delay.  We  make  the  conservative  assumption  that  when  an  input  to  a  functional 
element  changes,  the  outputs  may  behave  arbitrarily  until  they  settle  to  their  final  values. 

To  be  precise,  we  model  a  circuit  as  a  finite,  vertex- weighted,  edge- weighted,  directed  multi¬ 
graph  G  =  ( V,E,d,w )  (henceforth,  we  shall  simply  say  “graph”  or,  more  frequently,  “circuit"). 
Figure  3  shows  the  graph  of  Correlator  1  from  Figure  1.  The  vertices  V  of  the  graph  model  the 
functional  elements  of  the  circuit.  Each  vertex  t;  £  V  is  weighted  with  its  numerical  propaga- 
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tion  delay  d(v).  The  directed  edges  E  of  the  graph  model  interconnections  between  functional 
elements.  Each  edge  e  6  E  connects  an  output  of  some  functional  element  to  an  input  of  some 
functional  element  and  is  weighted  with  a  register  count  w(e).  The  register  count  is  the  number  of 
registers  along  the  connection.2  Between  two  vertices,  there  may  be  multiple  edges  with  different 
register  counts. 


Figure  3.  The  graph  model  of  Correlator  1  from  Figure  1. 

Vertices  can  be  designated  to  represent  interfaces  with  the  external  world,  and  each  such 
vertex  is  given  zero  propagation  delay,  as  is  shown  for  vertex  vj,  in  Figure  3.  (We  elaborate  on 
this  technicality  in  Section  10.)  If  the  relative  times  of  events  at  multiple  external  interfaces 
must  be  preserved,  we  treat  them  as  a  single  interface  and  represent  them  as  a  single  vertex  with 
multiple  incident  edges.  Otherwise,  we  assume  that  multiple  external  interfaces  are  independent 
of  each  other  and  cannot  communicate  with  each  other  externally.  For  most  of  our  theory, 
external  interfaces  can  be  handled  as  ordinary  vertices,  and  thus  our  formalism  omits  them. 

We  shall  use  the  following  terminology  extensively.  To  avoid  confusion  between  vertex-weignt 
functions  such  as  the  propagation  delay  d  and  edge- weight  functions  such  as  the  register  count 
to,  we  shall  use  the  term  weight  for  edge-weight  functions  only.  In  fact,  the  only  vertex-weight 
functions  we  use  are  the  propagation  delays  d(v),  and  in  general  we  shall  refer  to  the  particular 
edge  weights  w(e)  of  a  circuit  as  register  counts.  If  e  is  an  edge  in  a  graph  that  goes  from  vertex 
u  to  vertex  v,  we  shall  use  the  notation  u  v.  In  the  event  that  the  identity  of  either  the  head 
or  the  tail  of  an  edge  is  unimportant,  we  shall  use  the  symbol  ?,  as  in  u  ■£■?. 

For  a  graph  G,  we  shall  view  a  path  p  in  G  as  a  sequence  of  vertices  and  edges.  If  a  path  p 
starts  at  a  vertex  u  and  ends  at  a  vertex  v,  we  use  the  notation  u  •£+  v.  A  simple  path  contains 
no  vertex  twice,  and  therefore  the  number  of  vertices  exceeds  the  number  of  edges  by  exactly 
one. 

We  extend  the  register  count  function  w  in  a  natural  way  from  single  edges  to  arbitrary  paths. 
For  any  path  p  =  «o  v\  Q  u*,  we  define  the  path  weight  as  the  sum  of  the  weights  of 

the  edges  of  the  path: 

k-1 

W(P)  =  . 

i=0 

Similarly,  we  extend  the  propagation  delay  function  d  to  simple  paths.  For  any  simple  path 
p  =  Vq  -4.  v\  v k,  we  define  the  path  delay  as  the  sum  of  the  delays  of  the  vertices  of  the 

2If  an  output  of  a  functional  element  fans  out  to  more  than  one  other  functional  element,  the  single  interconnec¬ 
tion  can  be  treated,  without  loss  of  generality,  as  several  edges,  each  with  an  appropriate  weight.  Any  optimization 
can  be  translated  f tom  the  model  back  to  a  circuit  with  fanout.  Section  8  examines  fanout  mote  closely. 
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path: 


<Kp)  =  Ysd(v'}  • 

i=0 

In  order  that  a  graph  G  =  { V,E,d,w )  have  well-defined  physical  meaning  as  a  circuit,  we 
place  nonnegativity  restrictions  on  the  propagation  delays  d(v)  and  the  register  counts  w(e ): 

D1  The  propagation  delay  d(  v)  is  nonnegative  for  each  vertex  tig  V. 

Wl  The  register  count  w(e)  is  a  nonnegative  integer  for  each  edge  e  £  E. 

We  also  impose  the  restriction  that  there  be  no  directed  cycles  of  zero  weight: 

W2  In  any  directed  cycle  of  G,  there  is  some  edge  with  (strictly)  positive  register  count. 

We  define  a  synchronous  circuit  as  a  circuit  that  satisfies  Conditions  Dl,  Wl,  and  W2.  The 
reason  for  including  Condition  W2  is  that  whenever  an  edge  e  between  two  vertices  u  and  v  has 
zero  weight,  a  signal  entering  vertex  u  can  ripple  unhindered  through  vertex  u  and  subsequently 
through  vertex  v.  If  the  rippling  can  feed  back  upon  itself,  problems  of  asynchronous  latching, 
oscillation,  and  race  conditions  can  arise.  By  prohibiting  zero-weight  cycles,  Condition  W2 
prevents  these  problems  from  occurring,  provided  that  the  system  clock  runs  slowly  enough  to 
allow  the  outputs  of  all  the  functional  elements  to  settle  between  each  two  consecutive  ticks. 

For  any  synchronous  circuit  G,  we  define  the  (minimum  feasible)  clock  period  4>(G)  as  the 
maximum  amount  of  propagation  delay  through  which  any  signal  must  ripple  between  clock  ticks. 
Condition  W2  guarantees  that  the  clock  period  is  well  defined  by  the  equation 

$(G)  =  max  {d(p)  :  w(p)  =  0}  . 

For  the  circuit  graph  in  Figure  3  the  clock  period  is  24,  which  corresponds  to  the  sum  of  the 
propagation  delays  along  the  path  u<  — > ►  V5  — ►  t%  -*  vt- 

Determination  of  the  clock  period  $(G)  is  relatively  simple.  The  algorithm  we  present  here  is 
similar  to  an  algorithm  that  forms  a  part  of  a  design  tool  developed  at  American  Microsystems, 
Inc.  [17]. 

Algorithm  CP  ( Compute  the  clock  period  of  a  circuit).  This  algorithm  computes  the  clock 
period  $(G)  for  a  synchronous  circuit  G  =  {V,  E,d,w). 

1.  Let  Go  be  the  subgraph  of  G  that  contains  precisely  those  edges  e  with  register  count 
w(e)  =  0. 

2.  By  Condition  W2,  Go  is  acyclic.  Perform  a  topological  sort  on  Go,  totally  ordering  its 
vertices  so  that  if  there  is  an  edge  from  vertex  u  to  vertex  v  in  Go,  then  u  precedes  v  in 
the  total  order. 

3.  Go  through  the  vertices  in  the  order  defined  by  the  topological  sort.  On  visiting  each 
vertex  v,  compute  the  quantity  A(r)  as  follows: 

a.  If  there  is  no  incoming  edge  to  v,  set  A(u)  «—  d(v). 

b.  Otherwise,  set  A(v)  *—  d(v)  +  max  |A(u)  :  u  A  v  and  tn(e)  =  0  j. 

4.  The  clock  period  $(G)  is  max„ey  A(u).  | 

The  algorithm  works  because  for  each  vertex  u,  the  quantity  A(u)  equals  the  maximum  sum  d(p ) 
of  vertex  delays  along  any  zero-weight  directed  path  p  in  G  such  that  ?  v.  The  running  time 
is  0(|£|). 
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3  Retiming 

Retiming  transformations  alter  the  clock  period  of  a  circuit  by  inserting  and  deleting  registers, 
but  without  otherwise  affecting  the  circuit’s  structure.  This  section  formally  defines  retiming  and 
proves  some  simple  properties  of  the  transformation. 

A  retiming  can  be  viewed  as  an  assignment  of  a  lag  to  each  vertex  in  a  circuit,  and  this 

is  how  we  shall  define  it  formally.  A  retiming  of  a  circuit  G  =  ( V,E,d,w )  is  an  integer- valued 

vertex-labeling  r  :V  -*  Z.  The  retiming  specifies  a  transformation  of  the  original  circuit  in  which 
registers  are  added  and  removed  so  as  to  change  the  graph  G  into  a  new  graph  Gr  —  (V ,  E,  d,  wr), 
where  the  edge-weighting  wT  is  defined  for  an  edge  u  — ►  r  by  the  equation 

wT{e)  —  tu(e)  +  r(u)  —  r(u)  .  (2) 

In  the  example  of  Figure  3,  the  retiming  that  assigns  -1  to  functional  elements  u3,  i>4,  and  u5, 

and  assigns  0  to  all  other  vertices,  yields  the  circuit  of  Figure  2. 

Equation  (2),  which  tells  how  retiming  affects  the  register  counts  of  edges,  extends  naturally 
to  paths. 

Lemma  1.  Let  G  =  ( V,E,d,w )  be  a  synchronous  circuit,  and  let  r  be  a  retiming. 

Then  for  any  path  u  v  in  G,  we  have 

wr(p)  =  w(p)+r(v)  -  r(u)  . 

Proof.  Suppose  p  is  composed  of  vertices  and  edges  «o  f  l  vk-  We  have 

fc-l 

«v(p)  =  53«>r(ei) 

t=0 

*-l 

=  53  (w(«i)  +  r(v,+i)  -  r(t>,)) 

»=o 

=  + 53(r(t\+i)“r(v<)) 

i=0  1=0 

=  w(p)  4-  r(ufc)  -  r(r0) 
because  the  sum  on  the  right  telescopes.  | 

Corollary  2.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  r  be  a  retiming 
on  the  vertices  of  G.  Then  for  any  cycle  p  in  G,  we  have  wr(p)  =  w(p). 

Proof.  Immediate  from  Lemma  1. 1 

A  retiming  r  of  a  circuit  G  is  legal  if  the  retimed  graph  Gr  satisfies  Conditions  Wl  and  W2. 
An  arbitrary  assignment  of  lags  to  the  vertices  of  a  circuit  G  may  cause  the  retimed  circuit  Gr  to 
violate  Condition  Wl,  which  says  that  no  edge  may  have  a  negative  register  count.  This  condition 
must  be  checked  explicitly  in  order  to  ensure  that  a  retiming  is  legal.  Interestingly  enough. 
Condition  W2  need  not  also  be  checked  because  of  the  following  consequence  of  Corollary  2. 

Corollary  3.  Let  G  -  { V,E,d,w )  be  a  synchronous  circuit,  and  let  r  be  a  retiming 
on  the  vertices  of  G  such  that  GT  satisfies  Condition  Wl.  Then  r  is  a  legal  retiming. 
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Proof.  Since  the  propagation  delays  d  are  unaffected  by  retiming,  Condition  Dl  is  satisfied  by 
graph  Gr.  Since  Condition  W1  is  true  by  supposition,  it  remains  only  to  show  that  Gr  satisfies 
Condition  W2.  Let  p  be  any  cycle  in  G.  We  must  show  that  p  includes  at  least  one  edge  e  such 
that  wT(e)  >  0.  Since  graph  G  satisfies  Conditions  W1  and  W2,  the  register  count  w{p)  of  the 
cycle  in  G  must  be  positive.  But  by  Corollary  2,  the  register  count  wr(p)  of  the  cycle  in  Gr  is 
equal  to  w(p)  and  is  therefore  positive.  Hence,  there  must  be  an  edge  on  the  cycle  in  Gr  that 
has  positive  register  count.  | 

To  conclude  this  section,  we  comment  that  it  is  necessary  to  prove  that  when  a  circuit  G 
is  retimed  to  produce  a  new  graph  Gr,  the  new  circuit  is  functionally  equivalent ,  as  seen  by 
the  external  world,  to  the  original — provided,  of  course,  that  Gr  satisfies  Conditions  Wl  and 
W2.  Such  a  proof  can  be  found  in  [14],  which  also  contains  a  technical  definition  of  the  term 
“equivalent.” 

Moreover,  we  can  show  that  retiming  is,  in  a  sense,  the  most  general  possible  method  for 
changing  the  register  counts  within  a  circuit  without  disturbing  the  circuit’s  function.  Although 
we  do  not  formally  prove  it  here,  we  outline  the  thread  of  reasoning.  (For  an  example  of  a 
similar  argument  used  to  prove  a  weaker  result,  see  the  proof  of  Theorem  3  in  [14]).  Without 
loss  of  generality,  assume  that  any  circuit  G  —  ( V ,  E,d,w)  under  discussion  has  the  following  two 
properties.  1.  Every  vertex  v  £  V  is  connected  by  a  path  to  some  external  interface.  2.  Every 
vertex  v  £  V  has  at  least  one  input.  (Otherwise,  v  computes  a  constant  function.3)  Given  the 
graph  of  such  a  circuit,  but  no  knowledge  of  what  functions  are  computed  by  the  functional 
elements,  it  is  impossible,  other  than  by  retiming,  to  alter  the  register  counts  on  the  edges  and 
be  assured  that  the  external  behavior  is  unchanged.  For  any  relabeling  of  the  edge  weights  that 
is  not  a  retiming,  an  adversary  can  specify  the  functional  elements  in  such  a  way  that  the  new 
circuit  behaves  differently  from  the  original  circuit.  We  omit  the  details  of  this  argument. 

4  An  algorithm  for  clock  period  minimization 

This  section  presents  a  polynomial-time  algorithm  for  retiming  a  circuit  so  as  to  maximize  per¬ 
formance.  Specifically,  we  solve  the  following  clock  period  minimization  problem:  Given  a  circuit 
graph  G  =  (V,E,d,w),  find  a  legal  retiming  r  of  G  such  that  the  clock  period  $(Gr)  of  the  retimed 
circuit  Gr  is  as  small  as  possible.  The  solution  of  this  problem  depends  on  some  basic  results 
from  combinatorial  optimization  and  graph  theory.  In  particular,  we  rely  on  the  fact  that  the 
following  linear  programming  problem  can  be  solved  efficiently. 

Problem  LP.  Let  S  be  a  set  of  m  linear  inequalities  of  the  form 

X j  —  X j  <  Qjj  (3) 

on  the  unknowns  x\ ,X2, . . .  ,xni  where  the  a are  given  real  constants.  Determine 
feasible  values  for  the  unknowns  x;,  or  determine  that  no  such  values  exist. 

Constraint  systems  in  which  each  constraint  has  the  form  of  Inequality  (3)  arise  in  shortest 
paths  problems  and  have  been  studied  extensively.  Such  a  system  of  linear  inequalities  can  be 
satisfied — or  determined  to  be  inconsistent — in  0{mn)  time  by  the  Bellman-Ford  algorithm  [10. 
P-  74J- 

The  algorithm  for  minimizing  the  clock  period  of  a  circuit  is  based  on  an  alternative  charac¬ 
terization  of  clock  period  in  terms  of  two  quantities  which  we  now  define: 

3In  the  graph  (Figure  1)  of  Correlator  1,  for  example,  we  do  not  use  functional  elements  to  input  the  constants 
o,,  but  have  instead  incorporated  them  into  the  comparators. 
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W(u,v)  =  min  {u>(p)  :  u  -£»  i>}  ,  ^ 

D(u,  v)  =  max  |d(p)  :  u u  and  tn(p)  =  W(u,  v)  j  . 

The  quantity  W(u,  v)  is  the  minimum  number  of  registers  on  any  path  from  vertex  u  to  vertex  v. 
We  call  a  path  u  v  such  that  w(p)  =  W(u,  v )  a  critical  path  from  u  to  v.  The  quantity  D(u,  v) 
is  the  maximum  total  propagation  delay  on  any  critical  path  from  ti  to  v.  Both  quantities  are 
undefined  if  there  is  no  path  from  u  to  v.  Figure  4  shows  the  values  for  Correlator  1. 
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Figure  4.  Table*  showing  the  values  of  the  functions  W  and  D  for  Correlator  1.  The  quantity 
iV(u,  v)  is  the  number  of  registers  on  a  minimum-weight  path  from  u  to  »,  and  D(u,  v)  is  the  maximum 
propagation  delay  along  any  such  critical  path.  The  distinct  entries  in  the  table  for  D  include  all 
possible  clock  periods  for  any  retiming  of  Correlator  1.  A  legal  retiming  r  produces  a  circuit  Gr  with 
clock  period  9(Gr)  <  c  if  and  only  if  Wr(u,  v)  >  0  wherever  D( «,  «)  >  c.  Circled  entries  in  the  table 
for  D  are  explained  in  the  last  paragraph  of  Section  4. 

Lemma  4.  Let  G  =  { V ,  E,  d,w)  be  a  synchronous  circuit,  and  let  c  be  any  positive 
real  number.  The  following  are  equivalent: 

4.1  9(G)  <  c. 

4.2  For  all  vertices  u  and  v  in  V,  if  D(u,v)  >  c,  then  W(u,  v)  >  1. 

Proof.  (4.1  =►  4.2):  Suppose  9(G)  <  c,  and  let  u  and  v  be  vertices  in  V  such  that  D(u ,  t?)  >  c. 
If  W(u,v)  =  0,  then  there  exists  a  path  p  from  u  to  v  with  propagation  delay  d(p)  =  D(u,v), 
which  is  greater  than  c,  and  register  count  w(p)  =  W(u,  v)  =  0.  Contradiction. 

(4.2  =>  4.1):  Suppose  4.2  holds,  and  let  u  -£**  v  be  any  zero-weight  path  in  G.  Then  we  have 
W(u,t>)  =  w(p)  =  0,  which  implies  d(p)  <  D( u,v)  <  c.  | 

It  is  not  difficult  to  compute  W  by  solving  the  all-pairs  shortest-paths  problem  in  G.  Common 
ways  of  solving  this  problem  are  the  Floyd- Warshall  method  [10,  p.  86],  which  runs  in  0(|K|3) 
time,  and  Johnson’s  algorithm  [7],  which  runs  in  0(\V\  \E\  +  )Vr|2  Ig  |K|)  time  using  the  Fibonacci 
heap  data  structure  due  to  Fredman  and  Tarjan  [2].  The  basic  operations  on  weights  used  by 
these  algorithms  are  addition  and  comparison.  The  following  algorithm  shows  that  with  a  suitably 
chosen  weight  function,  an  all-pairs  shortest-paths  algorithm  can  be  used  to  compute  both  W 
and  D. 

Algorithm  WD  (Compute  W  and  D ).  Given  a  synchronous  circuit  G  =  (V,  E,d,w).  this 
algorithm  computes  W(u,v)  and  D(u,v)  for  all  u,v  6  V  such  that  u  is  connected  to  i>  in  G. 

1.  Weight  each  edge  u  -^»?  in  E  with  the  ordered  pair  (w(e),-d(u)). 

I  2.  Using  the  weighting  from  Step  1,  compute  the  weight  of  the  shortest  path  joining  each 

connected  pair  of  vertices  by  solving  an  all-pairs  shortest-paths  algorithm.  (In  the  all-pairs 
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algorithm,  add  two  weights  by  performing  componentwise  addition.  Compare  weights 
using  lexicographic  ordering.) 

3.  For  each  shortest-path  weight  (x,y)  between  two  vertices  u  and  v,  set  W{u,v)  «—  x  and 
D(u,v)  «-  d(v)  -  y.  | 

The  reason  that  W  and  D  are  important  is  that  they  behave  nicely  under  retiming. 

Lemma  5.  Let  G  =  { V,E,d,w )  be  a  sychronous  circuit,  and  let  W  and  D  be  defined 
on  G  by  the  equations  (4).  Let  r  be  a  legal  retiming  of  G,  and  let  WT  and  DT  be 
defined  analogously  on  Gr  ■  Then 

5.1  a  path  p  is  a  critical  path  of  Gr  if  and  only  if  it  is  a  critical  path  of  G, 

5.2  Wr{u,v)  =  W(u,v)  +  r(tt)  —  r(u)  for  all  connected  vertices  u,  v  £  V,  and 

5.3  Dr(u,v)  =  D(u,v)  for  all  connected  vertices  u,v  6  V. 

Proof.  Condition  5.1  follows  from  Lemma  1  because  retiming  changes  the  weights  of  all  paths 
from  u  to  v  by  the  same  amount,  and  then  5.2  follows  immediately.  Condition  5.3  is  a  consequence 
of  5.2  together  with  the  fact  that  retiming  does  not  alter  propagation  delays.  | 

The  next  result  is  a  corollary  to  Lemma  5  which  shows  that  the  range  of  D  contains  the  clock 
periods  of  all  circuits  obtainable  by  retiming  G.  In  Figure  4  the  20  distinct  values  in  the  table 
for  D  include  all  possible  clock  periods  for  any  retiming  of  Correlator  1. 

Corollary  8.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  r  be  a  retiming 
of  G.  Then  the  clock  period  &(Gr)  is  equal  to  D(u,v)  for  some  u,v  E  V . 

Proof.  By  the  definition  of  clock  period,  the  circuit  Gr  contains  some  zero-weight  path  u  v 
such  that  d(p)  =  $(G>),  and  thus  we  have  Wr(u,v)  =  wr(p)  =  0.  Moreover,  no  zero-weight  path 
in  Gr  has  greater  propagation  delay  than  p,  which  implies  Dr(u,v)  =  d(p).  Hence,  by  Lemma  5 
we  have  $(G>)  =  Dr(u,v)  =  D(u,v).  | 

Lemma  4  and  Lemma  5  also  allow  us  to  characterize  the  conditions  under  which  a  retiming 
produces  a  circuit  whose  clock  period  is  no  greater  than  a  given  constant. 

Theorem  7.  Let  G  ~  {V,  E,d,w)  be  a  synchronous  circuit,  let  c  be  an  arbitrary 
positive  real  number,  and  let  r  be  a  function  from  V  to  the  integers.  Then  r  is  a  legal 
retiming  of  G  such  that  $(Gr)  <  c  if  and  only  if 

7.1  r(u)  -  r(v)  <  w(e)  for  every  edge  u  v  of  G,  and 

7.2  r(u)  -  r(v)  <  W(u,v )  -  1  for  all  vertices  u,v  6  V  such  that  D(u,v)  >  c. 

Proof.  By  Corollary  3,  the  retiming  r  is  legal  if  and  only  if  Condition  7.1  holds.  If  r  is  indeed  a 
legal  retiming  of  G ,  then  by  Lemma  4  the  retimed  circuit  Gr  has  clock  period  $(Gr)  <  c  precisely 
under  the  condition  that  WT{u,v)  >  1  for  all  vertices  u,v  e  V  such  that  Dr(u,v)  >  c.  Since  by 
Lemma  5  we  have  WT(u,  v)  =  W(u,v)  +  r(v)  -  r(u)  and  DT{u,v)  =  D(u.v),  this  condition  is 
equivalent  to  Condition  7.2.  | 

Theorem  7  provides  the  basic  tool  needed  to  solve  the  clock  period  minimization  problem. 
Notice  that  the  constraints  on  the  unknowns  r(t’)  in  the  theorem  are  linear  inequalities  involving 
only  differences  of  unknowns,  and  thus  we  have  an  instance  of  Problem  LP.4  Therefore,  using 

'Actually,  we  have  the  integer  linear  programming  version  of  the  problem  because  the  unknowns  r(r)  are 
required  to  be  integer.  Since  the  value  on  the  right-hand  side  of  each  equation  is  integer,  however,  the  Bellman- 
Ford  algorithm  produces  an  integer  optimal  solution  if  one  exists. 
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the  Bellman-Ford  algorithm  to  test  whether  a  retimed  circuit  exists  with  dock  period  less  than 
some  constant  c  takes  0(|V|3)  time  since  there  can  be  only  0(|V^|2)  inequalities. 

We  now  present  an  algorithm  to  determine  a  retiming  for  a  circuit  G  such  that  the  clock 
period  of  the  retimed  circuit  is  minimized. 

Algorithm  OPT1  (Clock  period  minimization ).  Given  a  synchronous  circuit  G  —  (V,E,d,w), 
this  algorithm  determines  a  retiming  r  such  that  $(<S>)  is  as  small  as  possible. 

1.  Compute  W  and  D  using  Algorithm  WD. 

2.  Sort  the  elements  in  the  range  of  D. 

3.  Binary  search  among  the  elements  D(u,  r)  for  the  minimum  achievable  dock  period.  To 
test  whether  each  potential  clock  period  c  is  feasible,  apply  the  Bellman- Ford  algorithm 
to  determine  whether  the  conditions  in  Theorem  7  can  be  satisfied. 

4.  For  the  minimum  achievable  clock  period  found  in  Step  3,  use  the  values  for  the  r(v) 
found  by  the  Bellman-Ford  algorithm  as  the  optimal  retiming.  | 

Algorithm  0PT1  runs  in  0(|V|3lg|V|)  time.  For  some  circuits,  we  can  sometimes  improve 
the  performance  of  Algorithm  OPT1  by  using  a  smaller  set  of  inequalities.  (An  algorithm  with 
provably  better  asymptotic  performance  is  given  in  the  next  section.)  The  key  observation  is 
that  we  may  eliminate  any  inequality  r(u)  -  r(v)  <  W(u,v)  -  1  from  Condition  7.2  if  either 
D(u,  v)  —  d(v)  >  c  or  D(u,v)  —  <f(u)  >  c.  The  intuition  behind  this  optimization  is  that  there  is 
no  need  to  explicitly  require  a  path  p  to  have  positive  weight  wT(p)  if  we  already  require  some 
subpath  of  p  to  have  positive  weight. 


r  ( v, )  =  -I  r(v2)=-l  r(v3)s-2  r(v4)»-2 


Figure  5.  The  graph  model  of  an  optimal  correlator  with  clock  period  13.  The  circuit  is  obtained 
horn  the  graph  of  Correlator  1  in  Figure  3  by  applying  the  optimal  retiming,  determined  by  Algo¬ 
rithm  OPTl.  For  each  vertex  v,  the  value  r(v)  is  the  lag  of  v  with  respect  to  the  corresponding 
vertex  in  Correlator  1.  The  retimed  weight  of  an  edge  u  v  is  given  by  uir(e)  =  w(e)  +  r(v)  —  r(u). 

As  an  example,  Figure  5  shows  the  circuit  graph  of  Correlator  3,  which  can  be  obtained  from 
Correlator  1  by  applying  the  Bellman-Ford  algorithm  to  the  inequalities  from  Theorem  7  with 
clock  period  c  =  13.  There  are  11  inequalities  (one  for  each  edge)  that  must  be  satisfied  to  ensure 
a  legal  retiming — Condition  7.1  in  Theorem  7.  Of  the  potential  34  inequalities  arising  from  cases 
where  D(u,v)  >  13 — Condition  7.2  in  the  theorem — only  five  need  be  included  if  we  eliminate 
those  for  which  either  D(u,v)  -  d(v)  >  13  or  D(u,v)  -  d(u)  >  13.  In  the  table  for  D  in  Figure 
4,  those  entries  corresponding  to  the  five  relevant  inequalities  are  circled. 
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5  A  more  efficient  algorithm  for  clock  period  minimization 

In  this  section  we  describe  an  asymptotically  more  efficient  algorithm  for  the  clock  period  min¬ 
imization  problem.  Specifically,  we  will  show  that  the  feasible  clock  period  test  in  Step  3  of 
Algorithm  OPT1,  which  determines  whether  there  exists  a  retiming  of  G  with  clock  period  at 
most  c,  can  be  performed  in  0  (|V|  |E|)  time,  a  significant  improvement  over  O  (|V|3)  for  sparse 
graphs.  This  result  yields  an  0(1^1  |£|lg  |V|)-time  algorithm  for  determining  the  optimal  retim¬ 
ing. 

We  begin  with  the  0(|V|  |£|)-time  algorithm  for  determining  whether  a  given  clock  period  is 
feasible. 

Algorithm  FEAS  ( Feasible  clock  period  test.)  Given  a  synchronous  circuit  G  =  (V,  E,  d .  w)  and 
a  desired  clock  period  c,  this  algorithm  produces  a  retiming  r  of  G  such  that  Gr  is  a  synchronous 
circuit  with  clock  period  $(Gr)  <  c,  if  such  a  retiming  exists. 

1.  For  each  vertex  v  €  V,  set  r(v)  <—  0. 

2.  Repeat  the  following  |V|  -  1  times: 

2.1.  Compute  graph  Gr  with  the  existing  values  for  r. 

2.2.  Run  Algorithm  CP  on  the  graph  Gr  to  determine  A(t>)  for  each  vertex  v  6  V. 

2.3.  For  each  v  such  that  A(u)  >  c,  set  r(u)  <—  r(u)  +  1. 

3.  Run  Algorithm  CP  on  the  circuit  Gr.  If  we  have  $(Gr)  >  c,  then  no  feasible  retiming 
exists.  Otherwise,  r  is  the  desired  retiming.  | 

Proof  of  correctness.5  Algorithm  FEAS  works  by  relaxation.  Step  1  specifies  an  initial  tentative 
retiming  in  which  each  vertex  has  zero  lag  (so  that  Gr  =  G).  Each  iteration  of  Step  2  is  equivalent 
to  one  pass  of  a  Bellman-Ford  algorithm  on  the  constraints  in  Theorem  7.  We  assume  that  the 
tentat;ve  values  produced  during  each  pass  over  the  constraint  set  depend  only  on  the  tentative 
values  from  the  previous  pass. 

After  each  iteration  of  Step  2,  the  tentative  retiming  is  guaranteed  to  be  legal.  Consider  an 
edge  u  — *  v  in  Gr.  If  the  retimed  weight  wr(e)  is  strictly  positive  at  the  beginning  of  the  iteration, 
then  it  will  be  nonnegative  at  the  end  of  the  iteration  because  r(u)  earn  increase  by  at  most  1  and 
r(u)  cannot  decrease.  If  ivr(e)  =  0  at  the  beginning  of  the  iteration  and  if  r(u)  is  incremented, 
then  r(v)  will  be  incremented  as  well  because  A(w)  >  A(u)  +  d(u)  >  A(u)  >  c  in  this  case. 

It  remains  to  show  that  Step  2  simulates  a  pass  of  a  Bellman-Ford  algorithm  on  the  constraints 
from  Theorem  7.  Since  the  tentative  retiming  is  always  legal  at  the  beginning  of  an  iteration, 
the  constraints  7.1  are  already  satisfied.  Thus,  the  relaxation  step  in  the  inner  loop  of  the 
Bellman-Ford  algorithm  does  not  change  the  value  of  any  r(r)  for  these  constraints. 

To  see  that  the  effects  of  the  relaxations  due  to  the  constraints  7.2  are  achieved,  consider  any 
two  vertices  u,v  €  V.  If  we  have  D(u,v)  <  c,  then  no  inequality  for  critical  paths  from  u  to  u 
occurs  in  this  constraint  set.  If  the  retimed  critical  path  weight  Wr{u,v)  =  W(u,t>)  4-  r(t>)  -  r(u) 
is  positive,  then  the  corresponding  inequality  in  the  constraint  set  is  already  satisfied.  Finally, 
when  D{ u,v)  >  c  and  W(v,v)  =  0,  there  is  some  path  u  -£*  v  such  that  wT(p )  =  Wr(u.v)  =  0 
and  d(p)  =  D(u,v).  The  existence  of  this  path  implies  that  A(n)  >  d(p)  =  D(u,  v)  >  c.  so  that 
r(v)  will  be  given  the  new  value  r(r)  +  1  =  r(u)  +  Wr(u,  t?)  -  W(u,r)  +  1  =  r(u)  -  W(u,  v)  +  1, 
precisely  achieving  the  effect  of  the  desired  relaxation  of  the  constraint  r(u)  -  r(v)  <  W(u,  v)  —  1. 
Conversely,  r(v)  is  incremented  only  when  there  exists  some  path  x  A*  v  such  that  ttv(p)  =  0 
and  d(p)  =  A(r)  >  c,  implying  that  D( x,v)  >  d(p)  >  c  and  Wr(u,i;)  =  0.  | 

Algorithm  FEAS  can  be  used  to  improve  the  clock  period  minimization  algorithm  OPTl. 

5A  more  detailed  proof  can  be  found  in  [20]. 


Algorithm  OPT2  ( Clock  period  minimization).  Given  a  synchronous  circuit  G  =  (V,  E,  d,  w), 
this  algorithm  determines  a  retiming  r  such  that  $(G>)  is  as  small  as  possible. 

Run  Algorithm  0PT1  using  Algorithm  FEAS  in  Step  3,  rather  than  the  Bellman- Ford 
algorithm,  to  test  whether  each  potential  clock  period  c  is  feasible.  | 

Algorithm  0PT2  runs  in  0(|V|  |£J|lg|Kj)  time. 

6  A  mathematical  programming  framework  for  retiming 

In  this  section  we  describe  another  algorithm  for  clock  period  minimization  based  on  a  special 
case  of  mixed-integer  linear  programming.  Specifically,  we  will  show  that  the  feasible  clock  period 
test  can  be  performed  in  0  (| V\  |.E|  lg  jV|)  time.  Although  this  bound  is  not  an  improvement  over 
the  0(|V||E|)  bound  for  Algorithm  FEAS,  the  mathematical  programming  framework  in  this 
section  provides  further  insight  into  retiming. 

The  feasible  clock  period  test  can  be  reduced  to  the  following  mixed-integer  programming 
problem. 

Problem  MILP.  Let  S  be  a  set  of  m  linear  inequalities  of  the  form  x:  —  z,  <  atJ 
on  the  unknowns  xi,X2,  ■ . .  ,x„,  where  the  aij  are  given  real  constants,  and  let  k  be 
given.  Determine  feasible  values  for  the  unknowns  x,  subject  to  the  constraint  that  x, 
is  integer  for  i  -  1, 2, . . . ,  k  and  real  for  i  =  k  +  l,k  +  2, . . .  ,n,  or  determine  that  no 
such  values  exist. 

Although  mixed-integer  programming  is  in  general  NP-complete  (because  integer  program¬ 
ming  is  [4,  p.  245]),  this  special  case  can  be  solved  in  0(mn  +  fcmlgn)  time  [15].  The  reduction 
of  the  feasible  clock  period  test  to  Problem  MILP  makes  use  of  the  following  lemma. 

Lemma  8.  Let  G  =  ( V,E,d,w )  be  a  synchronous  circuit,  and  let  c  be  a  positive  real 
number.  Then  the  clock  period  ♦((?)  is  less  than  or  equal  to  c  if  and  only  if  there 
exists  a  function  s  :  V  — *  [0,  c]  such  that  s(v)  >  d(v)  for  every  vertex  v  and  such  that 
s(v)  >  a(u)  -f  d(v)  for  every  zero-weight  edge  u  -A  v. 

Proof.  For  each  vertex  v ,  let  A(t>)  be  the  maximal  sum  of  the  combinational  delays  along  any 
zero-weight  path  that  ends  at  v.  (This  A  is  the  same  as  the  one  in  Algorithm  CP.)  By  definition, 
we  have  $(G)  <  c  if  and  only  if  A(v)  <  c  for  all  v.  If  we  have  $(G)  <  c,  the  function  A  satisfies 
the  desired  properties  for  s.  Conversely,  if  a  function  s  exists  that  has  the  desired  properties, 
then  we  have  A(t/)  <  s(v)  <  c  for  every  vertex  v.  f 

Lemma  8  and  Corollary  3  together  give  a  characterization  of  when  it  is  possible  to  retime  a 
circuit  so  that  the  retimed  circuit  has  a  clock  period  of  c  or  less. 

Lemma  9.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  c  be  a  positive  real 
number.  Then  there  exists  a  retiming  r  of  G  such  that  $(Gr)  <  c  if  and  only  if  there 
exists  an  assignment  of  a  real  value  s(v )  and  an  integer  value  r(u)  to  each  vertex 
v  €  V  such  that  the  following  conditions  are  satisfied: 

9.1  -s(v)  <  ~d(v)  for  every  vertex  v  E  V, 

9.2  s(u)  <  c  for  every  vertex  v  €  V , 

9.3  r(u)  -  r(u)  <  u>(e)  wherever  u  A  v,  and 

9.4  s(u)  -  s(v)  <  -d(v)  wherever  u  A  v  such  that  r(u)  -  r(u)  —  u>(e). 


Proof.  Condition  9.3  captures  the  requirements  for  r  to  be  a  legal  retiming,  as  given  in  Corollary  3, 
namely  that  r  be  a  mapping  from  V  to  Z  such  that  Gr  satisfies  Condition  W1  (no  negative- 
weight  edges).  Conditions  9.1,  9.2,  and  9.4  capture  the  requirement  for  Gr  to  have  a  clock  period 
$(Gr)  <  c  as  given  in  Lemma  8.  (Recall  that  Gr  is  defined  to  have  uv(e)  =  u>(e)  4-  r(i>)  —  r(u) 
for  each  edge  u  v.)  | 

Unfortunately,  this  result  does  not  quite  allow  us  to  recast  an  instance  of  the  feasible  clock 
period  test  as  an  instance  of  Problem  MILP  because  of  the  qualifying  clause  “such  that  r(ti)  - 
r(v)  =  w(e)”  in  Condition  9.4.  The  next  theorem  shows  that  the  conditions  can  be  expressed 
without  such  a  clause. 

Theorem  10.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit,  and  let  c  be  a  positive 

real  number.  Then  there  is  a  retiming  r  of  G  such  that  $(Gr)  <  c  if  and  only  if  there 

exists  an  assignment  of  a  real  value  R(v)  and  an  integer  value  r(v)  to  each  vertex 

v  €  V  such  that  the  following  conditions  are  satisfied: 

10.1  r(v)  —  R(v)  <  -d(v)/c  for  every  vertex  v  €  V, 

10.2  R(v)  —  r(t>)  <  1  for  every  vertex  v  £  V, 

10.3  r(u)  —  r(v)  <  w(e)  wherever  u  v,  and 

10.4  R(u)  -  R(v)  <  tr?(e)  —  d(v)/c  wherever  u  v. 

Proof.  Any  solution  to  the  conditions  in  Lemma  9  can  be  converted  to  a  solution  to  the  conditions 
above  by  using  the  same  values  for  the  r(t>)  and  taking  R(v)  =  r(v)  -f  s(v)/c  for  each  vertex  v. 
Conversely,  any  solution  to  the  conditions  above  yields  a  solution  to  the  conditions  in  Lemma  9 
using  the  substitution  a(v)  =  c  (#(»)  -  r(ti)).  | 

Theorem  10  is  the  basis  for  the  following  improvement  on  Algorithm  OPT1. 

Algorithm  OPT3  (Clock  period  minimization).  Given  a  synchronous  circuit  G  =  ( V,E.d,w ), 
this  algorithm  determines  a  retiming  r  such  that  $(Gr)  is  as  small  as  possible. 

Run  Algorithm  OPT1,  but  in  Step  3,  test  whether  each  potential  clock  period  c  is  feasible 
in  the  following  manner. 

3.1  Use  Theorem  10  to  produce  an  instance  of  Problem  MILP  that  has  a  solution  if  and 
only  if  clock  period  c  is  feasible. 

3.2  Use  the  algorithm  from  [15]  to  determine  whether  the  instance  of  Problem  MILP  has 
a  solution.  | 

This  algorithm  can  be  made  to  run  in  0(\V\  \E\  Ig  fV']  +  IUI2  lg2  |  V|)  time  by  choosing  efficient 
algorithms  for  each  of  the  steps.  If  Johnson's  all-pairs  shortest-paths  algorithm  [7]  using  the 
Fibonacci  heap  data  structure  due  to  FVedman  and  Tarjan  [2]  is  used  in  Algorithm  WD,  Step  1 
runs  in  0( \V\  |i?|  +  ( V)2  lg  [ V|)  time.  Since  there  are  only  0( |U|2)  elements  in  the  range  of  D, 
Step  2  runs  in  0(\V |2  lg  |F|)  time.  Each  iteration  of  the  binary  search  in  Step  3  requires  solving  an 
instance  of  Problem  MILP  with  |V|  integer  variables,  |V|  real  variables,  and  2|V’|  +  2|£|  =  0(|£|) 
inequalities.  Thus  the  total  time  for  Step  3  is  0(\V\  |E|lg  lUI  +  IUI2  lg2  |V|).  The  optimal  retiming 
from  Step  4  is  produced  as  a  side  effect  of  Step  3. 

7  Unit  propagation  delay,  systolic  circuits,  and  slowdown 

This  section  examines  circuits  in  which  the  propagation  delays  of  all  functional  elements  are 
equal.  For  such  circuits,  the  clock  period  minimization  problem  can  be  solved  more  simply  than 
for  arbitrary  circuits.  In  this  section,  we  explore  the  relation  of  this  class  of  circuits  to  systolic 
computation  [8],  [9],  [11],  [14].  We  observe  that  many  systolic  circuits  in  the  literature  can  support 
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several  independent,  interleaved  computations.  In  [14],  we  introduced  a  transformation  called 
slowdown  which,  when  coupled  with  retiming,  can  be  used  to  produce  a  systolic  circuit  from 
an  arbitrary  synchronous  circuit.  In  this  section,  we  give  an  efficient  algorithm  for  determining 
whether  any  given  circuit  G  can  be  produced  from  another  circuit  (called  a  reduced  form  of  G) 
by  slowdown  and  retiming.  If  such  a  reduced  circuit  exists,  then  our  algorithm  finds  one. 

We  define  a  circuit  G  =  (V,E,d,w)  to  be  a  unit-delay  circuit  if  each  vertex  v  €  V  has 
propagation  delay  d(v)  =  1.  The  next  theorem  gives  a  characterization  of  when  a  unit-delay 
circuit  has  clock  period  less  than  or  equal  to  c.  The  theorem  is  phrased  in  terms  of  the  graph 
G  -  1/c,  which  is  defined  as  G  -  1/c  =  (V,E,d,  w')  where  w'(e)  =  tu(e)  -  1/c  for  every  edge 
e  €  E.  Thus  G  —  1/c  is  the  graph  obtained  from  G  by  subtracting  1/c  from  the  weight  of  each 
edge  in  G. 

Theorem  11.  Let  G  —  (V,E,d,w)  be  a  unit-delay  synchronous  circuit,  and  let  c  be 
any  positive  integer.  Then  there  is  a  retiming  r  of  G  such  that  $(Gr)  <  c  if  and  only 
if  G  -  1/c  contains  no  cycles  having  negative  edge  weight. 

Proof.  First,  suppose  G  —  1/c  has  no  negative-weight  cycles.  We  shall  produce  a  retiming  r  of 
G  such  that  $(Gr)  <  c.  Assume  without  loss  of  generality  that  there  is  a  path  from  each  vertex 
v  of  G  to  some  vertex  vo  (if  not,  add  edges  of  the  form  v  — ►  vo  with  sufficiently  large  weight  so 
that  no  negative-weight  cycles  are  introduced  into  G  —  1/c),  and  let  g(v)  be  the  weight  of  the 
shortest  path  from  v  to  no  in  G  —  1/c.  For  each  vertex  t>,  let  r(v)  =  [p(t;)"]. 

We  now  prove  that  the  function  r  so  defined  is  a  legal  retiming  and  that  $(Gr)  <  c.  First 
we  show  legality  by  showing  wT(e)  =  w(e)  +  r(t?)  -  r(u)  >  0  for  every  edge  u  A  r.  The  shortest 
path  in  G  -  1/c  from  u  to  t>o  is  at  least  as  short  as  the  path  u  v  -£+  t>o,  where  p  is  the 
shortest  path  (in  G  -  1/c)  from  v  to  uq.  Thus,  we  have  g(u)  <  g(v)  +  tt7(e)  -  1/c.  Taking 
ceilings  of  both  sides  gives  r(u)  <  [^(i;)  +  w(e )  -  1/c]  <  [j(u)]  +  w(e)  <  r( v)  +  tu(e),  and  thus 
wr(e )  =  w(e)  +  r(v)  -  r(u)  >  0,  as  desired. 

Next,  we  must  show  that  the  clock  period  of  the  retimed  circuit  Gr  is  at  most  c.  That  is,  we 
must  show  that  t vT(p)  >  1  for  any  path  u  ■?**  v  containing  c  or  more  edges.  The  shortest  path 
from  u  to  vo  is  at  least  as  short  as  the  path  u  ■£**  v  vo,  where  q  is  the  shortest  path  from  v  to 
v0.  Furthermore,  the  total  weight  along  pin  G  -  1/c  is  at  most  w(p)  -  1,  since  there  are  at  least  c 
edges  in  the  path.  Thus,  we  have  g( u)  <  g(v)  +  w{p)  —  1,  and  wT(p)  =  tn(p)  +  fp(v)]  -  [p(u)]  >  1. 

On  the  other  hand,  suppose  G  -  1/c  contains  some  cycle  p  with  negative  weight.  We  must 
prove  that  G  cannot  be  retimed  to  have  a  clock  period  of  c  or  less.  Let  n  be  the  number  of 
edges  in  the  cycle  p.  By  the  definition  of  G  -  1/c,  we  have  w(p)  -  n/c  =  w'(p),  where  w'  is 
the  edge- weight  function  for  G  -  1/c.  But  by  supposition,  ui'(p)  is  negative,  which  means  that 
xv(p)  -  n/c  <  0,  that  is,  the  cycle  p  contains  fewer  than  n/c  registers  in  G.  But  retiming  leaves 
the  number  of  registers  on  any  cycle  unchanged  (Corollary  2).  Thus,  no  matter  how  the  fewer 
than  n/c  registers  are  distributed  on  the  cycle  of  n  vertices,  there  must  be  some  register-free 
path  with  at  least  c  edges  and,  therefore,  with  at  least  c+  1  vertices.  Consequently,  G  cannot  be 
retimed  to  have  a  clock  period  of  c  or  less.  | 

To  test  whether  there  is  a  retiming  r  of  a  unit-delay  circuit  G  such  that  $(Gr)  <  c.  we  ran 
use  the  Bellman-Ford  algorithm  to  find  the  weight  g(v)  of  the  shortest  path  in  G  -  1/c  from 
each  vertex  v  to  an  arbitrary  vertex  i>o-  If  the  shortest- path  weights  are  not  well  defined,  the 
Bellman- Ford  algorithm  detects  a  negative- weight  cycle,  which  means  that  no  retiming  exists. 
Thus,  the  feasible  clock  period  test  can  be  performed  in  0(|V|  |£’|)  time  for  unit-delay  circuits 
using  the  Bellman-Ford  algorithm  directly. 

A  systolic  circuit  is  a  unit-delay  circuit  in  which  there  is  at  least  one  register  along  every 
interconnection  between  two  functional  elements.  Thus  the  clock  period  of  a  systolic  circuit  is  the 
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minimum  possible — the  propagation  delay  through  a  single  functional  element.  Systolic  circuits 
have  been  studied  extensively  ([8],  [9],  [11],  [14]),  and  they  have  many  applications  including 
signal  processing,  matrix  manipulation,  machine  vision,  and  raster  graphics. 

Interpreted  in  the  context  of  systolic  circuits,  Theorem  11  is  a  generalization  of  the  Systolic 
Conversion  Theorem  from  [14],  which  says  that  G  can  be  retimed  to  be  systolic  if  the  constraint 
graph  G  -  1  has  no  cycles  of  negative  weight.  (Simply  restrict  Theorem  11  to  the  case  where 
c  =  1.)  The  Systolic  Conversion  Theorem  is  generalized  in  a  different  way  in  [14],  however, 
through  the  idea  of  slowdown. 

For  any  circuit  G  =  (V,E,d,w)  and  any  positive  integer  c,  the  circuit  cG  is  the  circuit 
obtained  by  multiplying  all  the  register  counts  in  G  by  c.  That  is,  the  circuit  cG  is  defined  as 
cG  =  { V,E,d,w ')  where  w'(e)  =  cw(e)  for  every  edge  e  6  E.  All  the  data  flow  in  cG  is  slowed 
down  by  a  factor  of  c,  so  that  cG  performs  the  same  computations  as  G,  but  takes  c  times  as 
many  clock  ticks  and  communicates  with  the  external  interfaces  only  on  every  cth  clock  tick.  In 
fact,  cG  acts  as  a  set  of  c  independent,  interleaved  instances  of  G. 

If  a  circuit  G  can  be  obtained  by  retiming  a  circuit  of  the  form  cG',  then  we  say  that  G  is  a 
c-slow  circuit,  and  more  specifically,  that  G  is  a  c-slow  form  of  G'.  In  this  situation,  we  say  G' 
is  a  reduced  form  of  G.  The  main  advantage  of  a  c-slow  circuit  is  that  it  can  often  be  retimed  to 
have  a  shorter  clock  period  than  any  of  its  reduced  forms.  For  some  applications,  throughput  is 
the  issue,  and  multiple,  interleaved  streams  of  computation  can  be  effectively  utilized.  A  c-slow 
circuit  that  is  systolic  offers  maximum  throughput. 

The  following  corollary  to  Theorem  11  tells  when  a  circuit  has  a  c-slow  form  which  is  systolic. 

Corollary  12.  Let  G  =  (V,E,d,w)  be  a  unit-delay  synchronous  circuit,  and  let  c  be 

an  arbitrary  positive  integer.  Then  the  following  are  equivalent: 

12.1  The  graph  G  -  l/c  has  no  negative- weight  cycles. 

12.2  The  circuit  G  can  be  retimed  to  have  clock  period  less  than  or  equal  to  c. 

12.3  The  circuit  cG  can  be  retimed  to  be  systolic. 

Proof.  That  12.1  and  12.2  axe  equivalent  is  exactly  Theorem  11.  The  equivalence  of  12.1  and 

12.3  follows  by  applying  Theorem  11  to  cG  with  clock  period  1,  and  observing  that  cG  -  1  has 
a  negative- weight  cycle  if  and  only  if  G  —  l/c  has  a  negative- weight  cycle.  | 

The  registers  of  a  circuit  of  the  form  cG  are  naturally  divided  into  c  equivalence  classes.  Given 
any  two  registers  A  and  B  in  cG,  the  number  of  registers  on  any  two  paths  from  register  .4  to 
register  B  are  congruent  modulo  c.  Moreover,  if  we  consider  undirected  paths,  in  which  edges 
can  be  traversed  in  the  reverse  direction,  and  if  we  generalize  the  notion  of  path  weight  by 
adding  1  for  each  register  on  a  forward  edge  and  subtracting  1  for  each  register  on  a  reverse 
edge,  the  register  counts  of  two  undirected  paths  from  register  A  to  register  B  are  also  congruent 
modulo  c.  Consequently,  the  registers  are  naturally  divided  into  equivalence  classes  according  to 
their  undirected  path  weight  (modulo  c)  from  an  arbitrary  vertex. 

At  any  given  time  step,  any  two  registers  in  different  equivalence  classes  contain  data  from 
independent  streams  of  computation — data  that  can  never  arrive  at  inputs  of  the  same  functional 
element  at  the  same  time.  Although  retiming  destroys  the  individual  identities  of  the  registers, 
Lemma  1  guarantees  that  the  registers  of  any  c-slow  circuit  can  still  be  partitioned  into  c  such 
equivalence  classes. 

Using  the  notion  of  equivalence  classes  of  registers,  the  following  scenario  illustrates  the 
relationships  given  in  Corollary  12.  Let  G  =  (V,E,d,w)  be  a  unit-delay  synchronous  circuit. 
Find  an  integer  c  such  that  G  -  l/c  has  no  negative- weight  cycles,  and  consider  the  circuit  cG. 
There  is  a  retiming  r  of  cG  such  that  (cG)r  is  systolic.  If  we  remove  all  the  registers  in  the  c-siow 
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circuit  ( cG)r  except  for  those  in  one  equivalence  class,  the  resulting  circuit  is  a  retimed  form  of 
the  original  circuit  G ,  and  its  clock  period  is  less  than  or  equal  to  c. 

Many  systolic  circuits  appearing  in  the  literature  are  2-slow  or  3-slow — even  if  the  ideas 
of  slowdown  and  retiming  were  not  explicitly  used  in  their  design.  For  example,  the  systolic 
algorithms  for  band-matrix  multiplication  and  LU -decomposition  from  [9]  are  3-slow  and  can 
support  three  independent,  interleaved  streams  of  computation.  If  all  independent  streams  of 
computation  cannot  be  utilized  in  a  c-slow  circuit,  it  may  be  desirable  to  remove  all  registers 
except  for  those  in  one  equivalence  class.  The  following  algorithm  determines  if  a  circuit  is 
actually  a  c-slow  form  of  another,  and  if  so,  produces  a  reduced  form  of  the  circuit. 

Algorithm  R  ( Remove  all  but  one  equivalence  class  of  registers  in  a  circuit).  Given  a  syn¬ 
chronous  circuit  G  =  (V,E,d,w)  this  algorithm  determines  the  largest  c  such  that  G  is  c-slow 
and  produces  a  reduced  circuit  G'  such  that  G  is  a  c-slow  form  of  G' . 

1.  For  each  vertex  v  €  V,  set  dist(v)  to  the  weight  of  some  undirected  path  from  v  to  an 
arbitrary  vertex  i>o  €  V. 

2.  Compute  c  =  gcd  |tu(e)  +  dist(v)  -  dist(u) :  u  vj. 

3.  For  each  vertex  v  €  V,  set  r(v)  =  dist(v)  mod  c. 

4.  Produce  G'  =  (V,E,d,w'),  where  ui'(e)  =  (w(e)  4-  r(v)  -  r(u))/c  for  each  edge  u  v. 
Proof  of  correctness.  We  first  show  that  for  each  edge  u  v,  the  value  ti/(e)  is  a  legal  register 
count.  By  construction,  c  evenly  divides  to(e)  +  r(«)  —  r(tt)  because  c  divides  w(e)  +  dist(v)  - 
dist(u).6  Thus,  for  any  edge,  the  register  count  u/(e)  produced  in  Step  4  is  guaranteed  to  be  an 
integer.  In  addition,  w'(e)  is  guaranteed  to  be  strictly  greater  than  - 1  because  r(u)  must  be  less 
than  c  and  tu(e)  +  r(u)  is  at  least  0.  Since  we  have  just  shown  that  w'(e)  is  an  integer,  it  must 
be  nonnegative. 

The  construction  in  Step  4  directly  provides  the  identity  G'  =  Gr/c,  and  thus  G  is  a  c-slow 
form  of  G'.  We  now  show  that  the  c  computed  in  Step  2  is  the  largest  possible.  Suppose  there 
is  a  c'  such  that  G  is  a  c'-slow  form  of  another  circuit  G' .  We  wish  to  show  that  c'  divides 
w(e)  4-  dist(v)  -  dist(u)  for  each  edge  u  v,  and  thus  that  d  divides  c.  For  every  vertex 
v ,  the  weights  of  all  undirected  paths  in  c'G'  from  v  to  vo  are  congruent  modulo  c'.  Since 
retiming  changes  all  path  weights  between  two  vertices  by  the  same  amount  (which  is  provable 
for  undirected  paths  by  generalizing  Lemma  1),  it  must  be  the  case  that  in  G,  the  weights  of 
all  undirected  paths  from  v  to  v0  are  congruent  modulo  c’.  In  particular,  the  weight  of  the  path 
u  -**  v o  that  determines  dist(u)  and  the  weight  tu(e)  -f  dist(v)  of  the  path  u  A  v  — «•  vq  must  be 
congruent  modulo  c'.  Hence  c'  divides  w(e)  +  dist(v)  -  dist(u).  | 

Step  1  of  Algorithm  R  can  be  performed  in  time  0(| V|  +  |£j)  =  0(|£|)  by  depth-first  search. 
Step  3  runs  in  0(|V^|)  time,  and  Step  4  takes  0(|£j)  time.  Step  2  takes  more  work,  but  not 
much  more.  The  computation  of  the  greatest  common  divisor  of  |£|  integers  can  be  performed 
in  0(|£|  +  lgz)  time,  where  x  is  the  least  nonzero  absolute  value  of  any  of  the  numbers.  Just 
start  with  this  value  x — which  can  be  found  in  0(\E\ )  time — as  a  tentative  gcd,  and  gcd  in  each 
of  the  other  numbers  in  any  order.  Each  mod  operation  in  Euclid’s  algorithm  either  uses  up  one 
of  the  | E\  numbers,  or  else  divides  the  current  tentative  gcd  by  the  golden  ratio  (1  -f-  \/b)/2.7 
(As  a  practical  matter,  starting  with  any  of  the  |£|  numbers  as  the  initial  tentative  gcd  would 
give  reasonable  performance,  since  the  number  of  registers  in  a  typical  circuit  is  much  less  than 
exponential  in  the  number  of  edges.)  Thus  the  total  running  time  of  Algorithm  R  is  0(\E\  +  lg  z). 

®If  w(e)  -+■  di»t(v)  —  dist(u)  =  0  for  every  edge  u  —  v,  then  in  Step  2  we  get  c  =  oo.  Using  the  standard 
convention  that  t  mod  oo  =  i  and  x/oo  =  0,  Algorithm  R  yields  a  reduced  graph  G'  in  which  each  edge  u  v 
has  weight  w'(e)  =  0. 

7That  is,  if  n  mod  operations  are  performed  to  compute  a  new  tentative  gcd,  then  it  will  be  smaller  than  the 
old  tentative  gcd  by  at  least  a  factor  of  ((1  +  y/E)/2)n~t. 
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Observe  that  Algorithm  R  works  not  only  for  unit-delay  circuits,  but  for  any  synchronous 
circuit.  Furthermore,  when  the  extra  equivalence  classes  of  registers  from  a  c-slow  circuit  are 
removed,  the  clock  period  of  the  reduced  circuit  is  not  unduly  lengthened.  The  definition  of  w' 
in  Step  4  provides  that  for  any  path  p  of  weight  w(p)  >  c  in  G,  we  have  w'(p)  >  0,  which  implies 
that  $(G')  <  c$(G).  To  guarantee  the  minimum  clock  period,  however,  the  reduced  circuit  must 
generally  be  retimed. 

A  systolic  circuit  that  is  naturally  c-slow  can  be  converted  by  Algorithm  R  into  a  circuit 
that  performs  an  operation  on  every  clock  tick  and  whose  clock  period  is  bounded  by  c.  This 
conversion  can  result  in  a  performance  advantage  because,  in  practice,  there  are  time  penalties 
associated  with  the  loading  of  registers.  Because  of  this  overhead,  c  clock  ticks  of  a  circuit  with 
nominal  period  1  typically  use  more  time  than  one  clock  tick  of  a  circuit  with  nominal  period 
c.  Also,  a  reduction  in  registers  may  save  chip  area,  which  can  lead  to  further  performance 
improvements  since  the  wires  will  in  general  be  shorter.  A  possible  disadvantage  of  reducing  the 
number  of  equivalence  classes  of  registers  is  that  throughput  is  also  reduced  in  cases  where  the 
independent  streams  of  computation  might  be  effectively  utilized. 

8  Register  minimization  and  fanout 

Thus  far,  we  have  concentrated  on  clock  period  as  the  objective  function  for  determining  a 
retiming.  In  Section  7,  however,  we  showed  that  the  number  of  registers  in  a  circuit  could 
sometimes  be  reduced  by  a  method  other  than  retiming.  This  section  shows  that  the  problem  of 
retiming  a  circuit  to  minimize  the  total  state  of  a  circuit  is  polynomial-time  solvable  by  reducing 
that  problem  to  a  minimum-cost  flow  [10,  p.  129)  problem.  We  also  show  that  the  total  state  of  a 
circuit  can  be  minimized  subject  to  a  bound  on  the  clock  period.  These  results  can  be  extended 
to  reflect  the  widths  of  the  interconnections  and  ways  by  which  fanout  is  modeled. 

For  a  given  circuit  G  =  (V,  E ,  d,  w),  the  state  minimization  problem  is  to  determine  a  retiming 
r  such  that  the  total  state  5(Gr)  =  £e€E  wr(e)  of  the  retimed  circuit  is  minimized.  This  problem 
can  be  solved  in  polynomial  time  as  the  following  theorem  shows. 

Theorem  13.  The  state  minimization  problem  can  be  reduced  to  the  minimum-cost 
flow  problem. 

Sketch  of  proof.  Let  G  =  ( V,E,d,w )  be  a  circuit.  We  seek  a  retiming  r  such  that  the  total  state 
S(Gr)  =  ICeeE  wr(e)  °f  the  retimed  circuit  is  minimized.  By  the  definition  of  wr,  we  have 

S(Gr)  =  zL  ) 

«€£< 

=  (w(e)  +  r(w)  -  K«)) 

t 

u— mj 

=  S(G)  +  r(u)(indegree(t>)  —  outdegreefu))  . 

vgV 

Since  5(G)  is  constant,  minimizing  5r(G)  is  equivalent  to  minimizing  the  quantity 

^  r(v)  (indegree(v)  -  outdegree(i;))  ,  (5) 

vGV 

which  is  a  linear  combination  of  the  r(v )  since  (indegree(r)  —  outdegree(i;))  is  constant  for  each 
v.  The  minimization  is  subject  to  the  constraint  that  for  each  edge  u  -f*  v,  the  register  count 
tnr(e)  is  nonnegative — that  is, 

r(u)  -  r(v)  <  w(e)  .  (6) 
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We  can  regard  each  edge  of  e  €  E  as  a  network  flow  arc  having  infinite  capacity  and  having  cost 
w(e)  per  unit  of  flow.  The  dual  of  the  linear  programming  problem  given  by  (5)  and  (6)  asks 
that  we  assign  to  each  edge  e  a  nonnegative  flow  /(e)  such  that  the  net  flow  out  of  any  vertex  v 
is 

Y  /(e)  -  Y  /(e)  =  outdegree(r)  -  indegree(u)  ,  (7) 

v-i? 

and  such  that  the  total  cost  Yle  €  £ttt(e)/(e)  is  minimized.  This  problem  can  be  expressed 
directly  as  a  minimum-cost  flow  problem  by  augmenting  the  flow  graph  with  a  source  and  a  sink, 
each  of  which  is  connected  to  each  vertex  by  an  edge  whose  capacity  is  determined  by  (7).  The 
lags  r(u)  in  the  minimum-state  retiming  are  the  dual  variables  (potentials)  for  the  optimal  flow 
fm(e),  which  most  minimum-cost  flow  algorithms  compute.  | 

The  dominant  cost  in  solving  the  state  minimization  problem  is  solving  the  minimum-cost 
flow  problem,  for  which  many  algorithms  exist  [1,  3,  5,  6,  18].  Using  the  algorithm  due  to  Orlin 
[18],  for  example,  the  state  minimization  problem  can  be  solved  in  0(\E\2  lg  |Vj  +  |V|  |£|  lg2 1 V|) 
time.  Under  the  assumption  that  the  largest  number  of  registers  on  any  single  edge  in  the  circuit 
is  at  most  polynomial  in  |V|,  we  can  use  the  algorithm  due  to  Goldberg  and  Tarjan  [6]  to  solve 
the  state  minimization  problem  in  0(\V\  |£|lg|V|lg(|V'|2  / 1£|))  time. 

More  complicated  problems  can  be  solved  within  the  same  framework.  For  example,  the  total 
state  of  a  circuit  can  be  minimized  subject  to  a  bound  on  the  clock  period.  Given  a  maximum 
allowable  clock  period  c,  we  wish  to  find  a  retiming  r  that  minimizes  the  state  S(Gr)  of  the 
retimed  circuit  subject  to  the  condition  that  $(G>)  <  c.  In  this  case,  we  must  minimize  the 
quantity  (5)  subject  to  the  constraints  from  Theorem  7,  which  require  that  r(u)  -  r(u)  <  w(e) 
wherever  u  v,  and  that  r(u)-r(r)  <  W(u,  «)-l  wherever  D(u,  r)  >  c.  The  state  minimization 
problem  remains  the  dual  of  a  minimum-cost  flow  problem,  but  the  flow  graph  is  augmented  with 
additional  edges.8 

The  state  minimization  problem  can  be  generalized  by  allowing  registers  on  different  edges 
to  have  different  costs.  For  example,  it  may  be  cheaper  to  add  a  register  along  a  one-bit  wide 
control  path  than  along  a  32-bit  wide  data  path.  We  may  model  such  situations  by  assigning  to 
each  edge  e  a  breadth  /3(e)  proportional  to  the  cost  of  adding  a  register  along  e.  The  objective 
function  which  we  must  minimize  is  then  given  by 

Z>w(E0«-I >«).  (8) 

■*>'  ' 

and  the  constraints  on  the  r(v)  are  unchanged.  This  problem  is  still  the  dual  of  a  minimum-cost 
flow  problem  since  the  quantity  in  the  large  parentheses  is  a  constant  for  each  v.  Although  the 
/3(e)  need  not  be  integers,  if  there  is  a  solution  to  the  state  minimization  problem,  there  is  an 
integer  optimal  solution  because  the  linear  programming  tableau  for  the  problem  is  unimodular 
and  the  right-hand  side  is  an  integer  vector. 

In  a  physical  circuit,  a  signal  from  a  register  or  functional  element  may  fan  out  to  several 
functional  elements.  As  was  mentioned  in  footnote  2  in  Section  2,  we  model  this  situation  with 
several  different  edges  in  the  circuit  graph.  For  the  clock  period  minimization  problem,  there  was 
no  harm  in  modeling  fanout  in  this  manner,  but  for  the  state  minimization  problem,  there  can 
be.  The  difficulty  that  arises  in  the  state  minimization  problem  is  that  registers  can  be  shared 
along  the  physical  interconnection.  The  objective  functions  (5)  and  (8)  do  not  take  sharing  into 
account. 

8See  (20,  Section  VII. 2]  for  a  more  extended  discussion  of  the  minimum-cost  flow  duals  of  the  variants  of  the 
state  minimization  problems  discussed  in  this  section. 


18 


Figure  0.  Modeling  two-way  fanout  with  an  extra  vertex  having  delay  zero. 

Fanout  can  be  incorporated  into  the  model  in  several  ways  that  allow  the  sharing  of  registers 
to  be  accounted  for  exactly.  We  begin  by  looking  at  the  situation  in  Figure  6(a)  where  one  vertex 
u  has  an  output  that  fans  out  to  two  vertices  v\  and  t>2-  To  deal  properly  with  this  situation  in  the 
state  minimization  problem,  it  is  sufficient  to  introduce  a  dummy  vertex  u  with  zero  propagation 
delay  which  models  the  fork  of  the  interconnection,  as  is  shown  in  Figure  6(b).  When  the  circuit 
is  retimed  to  minimize  the  number  of  registers,  either  the  edge  from  u  to  Vi  or  the  edge  from  u 
to  will  have  zero  register  count,  and  the  edge  from  u  to  u  will  have  the  shared  registers.  In 
Figure  6(c)  the  edge  to  v\  ends  up  with  zero  weight  after  retiming  so  as  to  minimize  the  total 
number  of  registers. 

Large  multiway  forks  present  some  modeling  alternatives  not  encountered  in  the  two-way 
case.  If  a  physical  interconnection  is  to  be  modeled,  the  fork  can  be  decomposed  into  several 
two-way  forks.  (In  fact,  our  concern  for  modeling  the  physical  interconnection  prompted  us  to 
design  Correlator  1  with  the  ij  running  through  the  comparators  rather  than  with  multiway 
fanout  directly  from  the  external  interface.) 

For  logical  design,  however,  it  may  be  undesirable  to  model  the  physical  interconnection.  In 
the  case  of  a  three-way  fork,  for  instance,  we  might  wish  to  share  the  largest  possible  number 
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of  registers  between  the  two  edges  with  greatest  register  counts,  regardless  of  which  two  edges 
these  end  up  being.  Modeling  a  k- way  fork  for  Jk  >  3  by  decomposing  the  interconnection  into 
two-way  forks  will  not  work. 

A  solution  to  this  problem  of  modeling  k-way  fanout  with  maximum  register  sharing  is  de¬ 
picted  in  Figure  7.  An  output  of  vertex  u,  having  breadth  /3,  fans  out  to  v\, . . . ,  t>*  along  edges 
#  ^  »i,ti  ^  ^  »*.  In  the  retimed  circuit,  the  cost  of  this  fanout  should  be  (3  times 

the  maximum  of  retimed  edge  weights  wr(e,).  So  that  the  register  count  cost  function  S(Gr) 
will  properly  model  the  register  sharing,  we  first  add  a  dummy  vertex  tt  with  zero  propagation 

delay.  Letting  w —  =  maxj<,<*  tn(e,),  we  add  edges  t>,  ti  with  weights  w(e,)  -  ii)mx  -  tn(e,). 
Finally,  we  give  all  edges  e,-  and  breadths  of  (3/k. 

The  modified  circuit  graph  accurately  models  the  sharing  of  registers  among  the  edges  e, 
involved  in  the  fanout  when  the  state  is  minimized.  For  any  retiming  r,  Lemma  1  dictates  that 
the  weights  uv(Pi)  of  all  paths  p,  =  u  »,•  -A  u  will  be  identical  since  they  are  identical  in  the 
unretimed  circuit.  The  retimed  register  counts  tnr(e,)  are  constrained  by  the  rest  of  the  circuit, 
but  the  weights  wr(ij)  will  be  as  small  as  possible  because  u  is  a  sink  in  the  graph.  Thus  the 
register  count  of  one  of  the  ii  will  be  zero,  and  therefore  the  weight  of  each  path  p,  will  be 
maxi<,<*  u>r(e«).  Since  there  are  k  paths,  each  with  breadth  /3/fc,  the  total  cost  of  the  paths  will 
be  (3  •  maxi<i<fc  tnr(ej)  as  desired. 


Figure  7.  A  gadget  for  modeling  the  cost  of  multiway  fanout  with  maximal  sharing  of  registers. 


9  A  more  general  model  for  propagation  delay 

In  this  section  we  extend  the  methods  of  Sections  5  and  6  to  deal  with  functional  elements 
in  which  the  propagation  delays  through  individual  functional  elements  are  nonuniform.  In  an 
adder,  for  example,  the  propagation  delay  from  a  low-order  input  bit  to  a  high-order  output  bit 
may  be  far  greater  than  the  propagation  delay  from  a  low-order  input  bit  to  a  low-order  output 
bit  or  from  a  high-order  input  bit  to  a  high-order  output  bit.  Thus,  the  worst-case  propagation 
delay  through  two  cascaded  adders  can  be  much  less  than  twice  the  worst-case  propagation  delay 
through  a  single  adder.  This  section  gives  a  more  general  circuit  model  to  handle  this  commonly 
occurring  situation.  We  show  how  the  retiming  problem  in  this  model  can  be  reduced  to  simple 
mixed-integer  programming  as  in  Section  6.  We  also  give  a  more  efficient  relaxation  algorithm 
similar  to  that  in  Section  5. 


Figure  8.  A  functional  element  with  nonuniform  propagation  delays.  The  time  at  which  output  X 
must  settle  is  either  4  esec  after  input  A  settles  or  11  esec  after  input  B  settles,  whichever  is  later. 

We  may  take  into  account  nonuniform  propagation  delays  through  functional  elements  by 
modifying  the  model  for  synchronous  circuits  given  in  Section  2,  so  that  from  each  input  to 
each  output  of  a  given  functional  element,  an  independent  propagation  delay  may  be  assigned. 
Figure  8  shows  graphically  the  “insides”  of  a  functional  element  in  this  model.  The  vertex  v 
contains  internal  edges  drawn  from  a  set  Fv  C  {e  :?  u}  x  {e  :  v  -^?}.  We  denote  the  set  of  all 
internal  edges  of  all  vertices  as  F  —  Uvgv  Fv. 

We  augment  our  arrow  notation  as  follows.  For  internal  edges  f„  £  Fu  and  fb  £  Fv,  we  use 
the  notation  fa  — *  fb  to  indicate  that  normal  edge  e  connects  internal  edge  fa  of  u  to  internal 
edge  /(,  of  v.  This  means  not  only  that  u  A  v,  but  also  that  there  are  edges  ?  -£  u  and  v  ^2?  such 
that  fa  =  (e/,e)  and  fb  =  ( e,eo ).  Notice  that  a  single  edge  u  v  typically  connects  multiple 
internal  edges  of  u  with  multiple  internal  edges  of  v. 

The  propagation  delay  function  d ,  rather  than  being  a  function  from  V'  to  the  nonnegative 
reals,  is  a  function  from  F  to  the  nonnegative  reals.  For  an  internal  edge  /  =  (ej.ej)  €  Fv ,  the 
value  d(f)  denotes  the  propagation  delay  through  /.  A  given  output  of  a  functional  element  v 
need  not  depend  on  all  the  inputs.  In  an  adder,  for  example,  the  values  of  the  high-order  input 
bits  have  no  effect  on  the  low-order  bits  of  the  output. 

We  view  paths  in  the  extended  model  as  going  from  an  internal  edge,  via  an  alternating 
sequence  of  normal  and  internal  edges,  to  an  internal  edge.  For  any  path  p  =  f0  ^  fx  . . . e tm 1 
/*,  the  delay  is  naturally  defined  as  d(p )  =  d(fo)  +  d(f\)  +  ••■  +  <!(/*),  and  the  register  count 
of  the  path  is  defined  as  w(p)  =  w(eo)  +  w(ei)  +  •  ■  ■  +  tn(ejt-i).  We  use  the  notation  u  r  to 
denote  that  path  p  goes  from  some  internal  edge  of  vertex  u  to  some  internal  edge  of  vertex  t’. 

The  clock  period  $(G)  of  a  circuit  G  =  {V,E,d,w)  is  the  maximum  delay  along  any  path  of 
zero  weight.  We  define  retiming,  which  affects  u>  but  not  d,  exactly  as  it  is  defined  in  the  model 
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of  Section  2. 9 

The  following  results  show  how  the  clock  period  minimization  problem  for  a  circuit  G  under 
the  extended  model  can  be  reduced  to  an  instance  of  Problem  MILP  having  one  integer  variable 
for  each  functional  element  of  G  and  one  real  variable  for  each  edge  of  G.  The  results,  which 
parallel  Lemma  8,  Lemma  9,  and  Theorem  10,  are  presented  without  proof. 

Lemma  14.  Let  G  =  ( V,E,d,w )  be  a  synchronous  circuit  in  the  extended  model, 
and  let  c  be  a  positive  real  number.  Then  the  clock  period  $(G)  is  less  than  or  equal 
to  c  if  and  only  if  there  exists  a  function  s  :  E  -*  [0,  c]  such  that 

14.1  s(e)  >  d(f)  wherever  f  and 

14.2  s(e(>)  >  s(ea)  +  d(f)  wherever  ?  ^4  /  ^4?  and  w(ea)  >  0. 

Lemma  15.  Let  G  =  { V,E,d,w )  be  a  synchronous  circuit  in  the  extended  model,  and 
let  c  be  a  positive  real  number.  Then  there  is  a  retiming  r  of  G  such  that  $(Gr )  <  c  if 
and  only  if  there  exists  an  assignment  of  a  real  value  s(e)  to  each  edge  e  €  E  and  an 
integer  value  r(u)  to  each  vertex  v  £  V  such  that  the  following  conditions  are  satisfied: 

15.1  -s(e)  <  -d(f)  wherever  f  A?, 

15.2  s(e)  <  c  for  every  edge  e  €  E, 

15.3  r(u)  —  r(v)  <  w(e)  wherever  u  -A  v,  and 

15.4  a(e„)  -  s(e 4)  <  -d(f)  wherever  u  ^  f  ^?,  r(ti)  -  r(i>)  =  w(ea),  and  f  6  Fv. 

Theorem  10.  Let  G  =  (V,E,d,w)  be  a  synchronous  circuit  in  the  extended  model, 
and  let  c  be  a  positive  real  number.  Then  there  is  a  retiming  r  of  G  such  that  $(Gr)  <  c 
if  and  only  if  there  exists  an  assignment  of  a  real  value  R(e)  to  each  edge  e  £  E  and 
an  integer  value  r(v)  to  each  vertex  v  €  V  such  that  the  following  conditions  are 
satisfied: 

16.1  r(v)  -  R(e)  <  -d(f)/c  wherever  f  A?  and  f  e  Fv. 

16.2  R(e)  -  r(t>)  <  1  wherever  v 

16.3  r(u)  -  r(v)  <  w(e)  wherever  u  A  v,  and 

16.4  f?(e0)  -  R(eb)  <  w(ea)  -  d(f)/c  wherever  ?  ^  f  ^?. 

Theorem  16  says  that  the  problem  of  testing  whether  a  given  clock  period  is  feasible  for 
a  circuit  G  =  ( V,E,d,w )  in  the  extended  model  can  be  efficiently  reduced  to  an  instance  of 
Problem  MILP  having  k  =  |V|  integer  variables,  n  -  k  =  |£|  real  variables,  an  ’  m  =  2\E\  +  2\d\ 
inequalities,  where  |d|  is  the  number  of  pairs  (ea,et,)  of  edges  for  which  d(ea,ei,)  is  defined.  Thus 
the  feasible  clock  period  test  can  be  performed  in  0(jd|(|£|  +  |Vr|lg|£|))  time. 

We  can  reduce  the  cost  of  the  feasibility  test  to  0(|V|  IFI)  by  using  an  algorithm  similar  to 
Algorithm  FEAS.  The  dominant  cost  of  this  computation  is  due  to  | V"|  executions  of  an  algorithm 
similar  to  Algorithm  CP,  each  of  which  runs  in  0(|F|)  time. 

“The  possibility  that  the  internal  connections  between  the  inputs  and  outputs  of  a  functional  element  may  not 
be  a  complete  bipartite  graph  gives  rise  to  some  technical  differences  between  the  extended  model  and  the  model 
of  Section  2.  First,  Condition  W2  need  only  be  imposed  for  those  cycles  in  which  consecutive  edges  are  actually 
connected  by  the  internal  data  paths  in  the  vertices.  Second,  retiming  may  not  be  the  only  way  to  adjust  register 
counts  so  that  function  is  guaranteed  to  be  preserved — if  the  (undirected)  graph  of  internal  connections  in  some 
functional  element  is  not  connected,  then  the  element  can  be  broken  up  into  two  or  more  independent  components 
which  can  be  given  different  lags. 
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The  clock  period  minimization  algorithm  for  circuits  in  the  extended  model  is  similar  to 
Algorithm  OPT2  which  performs  a  binary  search  over  a  set  of  possible  values  for  the  minimal 
clock  period.  By  the  same  argument  used  in  Corollary  6  for  the  model  of  Section  2,  the  optimal 
clock  period  must  be  equal  to  D( u,  v)  for  some  pair  of  vertices  u  and  v,  where  D  and  W  are  defined 
as  in  Equation  (4),  but  with  the  semantics  of  the  notation  interpreted  in  the  extended  model.  The 
values  D(u,v)  for  all  connected  pairs  of  vertices  u  and  v  can  be  found  in  O  (|£|  |Fj  -+  |Vj  \E\  lg  |£j) 
time  by  an  algorithm  similar  to  Algorithm  WD.  The  key  step  is  to  apply  Johnson’s  all-pairs 
shortest-paths  algorithm  [7]  to  the  edge-weighted  graph  H  =  ( E,F,wd ),  where  the  weighting 

function  is  defined  by  wd(f)  =  (te(e), -d(/))  wherever  ?  -^  f  in  G  (that  is,  wherever  e  — ? 
in  H).  Using  Fibonacci  heaps  [2]  for  the  priority  queue  in  Johnson’s  algorithm,  a  time  bound 
of  0{\E\  |F|  +  |£j2  lg|£j)  can  be  achieved.  One  additional  observation  is  required  to  prove  the 
claimed  time  bound  of  0  (|F|  |Fj  +  |  V\  \E\  lg  )£)).  The  dominant  cost  of  computing  W  and  D  by 
Johnson ’8  all-pairs  shortest  paths  algorithm  is  due  to  |£j  applications  of  Dijkstra’s  algorithm  to 
find  shortest  paths  from  each  vertex  e  €  H  to  each  other  vertex.  Since  W  and  D  are  defined 
onFxV  rather  than  on  E  X  E,  however,  we  really  only  need  to  solve  \V\  problems  of  the 
form:  Given  a  vertex  v  e  V ,  find  for  each  e  €  E  the  weight  of  a  shortest  path  from  v  to  e. 
Each  such  problem  can  be  solved  by  using  Dijkstra’s  algorithm  to  find  the  shortest  path  weights 
from  a  set  of  vertices  in  H  (namely,  from  any  x  €  E  such  that  ?  -^  v  in  G),  rather  than 
from  a  single  vertex.  Using  Fibonacci  heaps,  the  claimed  running  time  for  computing  W  and 
D  is  obtained.  The  total  cost  of  clock  period  minimization  in  the  extended  model  is  therefore 
0(\E\  |F|  +  |Vj  |£|lg|F|)  +  0(\V\  |F|lg|Vj)  =  0(\E\  |F|  +  |V|  |F|lg|V|). 

10  Concluding  remarks 

Our  goal  has  been  to  provide  a  general  framework  for  the  precise  understanding  of  circuit  timing. 
Through  the  use  of  a  simple  graph-theoretic  model,  we  have  been  able  to  cast  a  variety  of  circuit 
timing  problems  in  purely  combinatorial  terms.  We  believe  our  approach  to  be  robust.  Many 
other  circuit  models  and  many  other  circuit  problems  can  be  handled  within  the  basic  framework. 
We  take  time  here  to  discuss  a  few. 

Pipelining.  An  important  special  case  of  clock  period  minimization  is  the  problem  of  opti¬ 
mally  pipelining  combinational  circuitry.  In  a  combinational  circuit,  all  register  counts  are  zero, 
and  thus  the  circuit  graph  is  acyclic.  We  can  consider  the  circuit  to  have  one  input  interface  vj 
and  one  output  interface  vo ■  By  retiming  a  combinational  circuit  G,  we  can  produce  a  pipelined 
circuit  Gr  which  achieves  a  shorter  clock  period  at  the  cost  of  introducing  a  latency  of  r(vo)-r{vi ) 
clock  ticks  for  signals  to  propagate  from  the  input  interface  vi  to  the  output  interface  co¬ 
in  the  optimal  pipelining  problem,  we  are  given  a  combinational  circuit  G  and  a  nonnegative 
integer  i  and  asked  to  produce  a  retimed  circuit  Gr  with  minimum  clock  period  subject  to 
the  constraint  that  the  retimed  circuit  have  latency  at  most  /.  This  problem  is  just  the  clock 
period  minimization  problem  with  the  additional  constraint  that  r(u)  -  r(v)  <  l.  This  additional 
constraint  can  be  modeled  by  augmenting  the  circuit  G  with  an  edge  vo  -*  vi  having  weight 
w(e)  =  l.  Also,  the  methods  of  Section  8  can  be  applied  to  solve  the  problem  of  minimizing  the 
state  of  pipelined  circuitry  subject  to  upper  bounds  on  clock  period  and  latency. 

Timing  at  external  interfaces.  An  external  interface  may  be  forced  to  meet  various  timing 
specifications.  For  instance,  if  an  external  interface  has  a  known  time  delay  between  the  time  at 
which  it  receives  outputs  from  the  circuit  and  the  time  at  which  it  presents  inputs,  the  external 
vertex  can  be  assigned  a  propagation  delay  greater  than  0.  By  augmenting  the  set  of  inequalities 
specified  in  Theorem  10,  it  is  often  possible  for  the  optimization  algorithms  to  act  subject  to 
other  constraints.  If  data  must  be  available  to  an  interface  along  some  edge  v  E*  i>0  within  some 
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time  t  after  each  dock  tick,  for  example,  we  can  express  this  by  the  inequality 

R(v)  -  r(«o)  <  -  + 
c 

where  vo  is  the  vertex  representing  the  interface.  This  constraint  is  equivalent  to  saying  that  we 
must  have  A(v)  <  t  if  the  register  count  tiv(e)  of  edge  e  is  zero  in  the  retimed  circuit.  Similarly, 
if  data  from  the  interface  is  not  available  on  an  edge  Vo  A  v  until  some  time  t  after  each  clock 
tick,  this  constraint  can  be  expressed  by  the  inequality 

r(v0)  -  R(v)  <  tv(e)  -  -  -. 

c  c 

Geometric  considerations.  The  optimization  methods  discussed  in  this  paper  can  be 
applied  largely  independently  of  geometric  considerations  because  although  retiming  causes  the 
addition  and  deletion  of  registers,  it  otherwise  leaves  the  functional  elements  and  their  pattern 
of  interconnection  the  same.  Thus  if  a  given  circuit  has  an  area-efficient  layout,  chances  are  that 
a  retimed  form  of  the  circuit  can  be  laid  out  efficiently.  In  some  cases,  however,  the  floorplan  of 
a  circuit  may  limit  the  number  of  registers  on  certain  interconnections. 

The  inequalities  that  constrain  the  retimed  system  can  be  augmented  to  express  these  geo¬ 
metric  constraints.  For  example,  to  specify  an  upper  bound  k  on  the  number  of  registers  that 
can  fit  along  some  edge  tt  -A  v,  we  can  impose  the  constraint 

r(u)  -  r(u)  <  k  -  w(e). 

We  can  also  model  a  situation  in  which  the  first  k  registers  on  an  edge  u  -A  v  are  relatively  cheap 
and  additional  registers  are  more  expensive.  Add  an  auxiliary  vertex  u  in  the  middle  of  edge  e. 
Then  assign  a  high  cost  to  registers  on  the  edge  u  -*  it  and  a  low  cost  to  registers  on  the  edge 
u  — *  v,  but  constrain  u  -*  v  to  have  at  most  k  registers  in  the  retimed  system.  Solve  the  system 
of  constraints  as  in  Section  8.  On  the  other  hand,  if  the  first  register  on  a  connection  is  expensive 
and  additional  registers  are  cheap,  then  it  is  NP-complete  to  determine  whether  a  circuit  can  be 
retimed  to  achieve  a  specified  bound  on  register  cost  [20,  pp.  182-183]. 

Slowdown.  In  Section  7  we  showed  how  a  c-slow  circuit,  which  supports  c  independent 
streams  of  computation,  can  be  reduced  to  support  a  single  stream  of  computation  by  removing 
registers.  The  notion  of  c-slow  circuitry  offers  new  insight  into  many  circuit  designs  that  are 
not  technically  c-slow.  Consider,  for  example,  a  2-slow  circuit  in  which  only  one  stream  of 
computation  is  being  used.  The  registers  in  the  circuit  fall  into  two  equivalence  classes,  one  of 
which  is  idle  during  each  clock  period.  Using  Algorithm  R  to  remove  all  the  registers  in  one 
equivalence  class  is  one  way  to  optimize  such  a  circuit. 

Another  way  to  save  registers  is  to  modify  the  functional  elements  to  perform  slightly  different 
actions  on  even  and  odd  time  steps  so  that  each  physical  register  plays  the  roles  of  two  logical 
registers,  one  in  each  equivalence  class.  A  cursory  examination  of  the  resulting  circuit  would  not 
reveal  that  it  is  2-slow  according  to  the  circuit  model,  but  it  would  nevertheless  communicate 
with  the  host  only  on  every  other  clock  tick.  Although  this  method  for  saving  registers  may 
sometimes  be  acceptable,  the  overhead  of  register  multiplexing  and  the  complexity  of  control 
suggest  that  Algorithm  R  is  a  more  reasonable  alternative.  Moreover,  when  confronted  with  a 
circuit  that  communicates  externally  only  on  every  other  clock  tick  or  a  circuit  whose  functional 
elements  perform  different  operations  on  alternate  clock  ticks,  we  may  suspect  that  it  is  really  a 
2-slow  circuit  in  disguise,  and  that  penetrating  the  disguise  might  lead  to  improved  performance 
and  simplification  of  the  control  logic. 

Data-dependent  propagation  delays.  A  major  deficiency  of  the  circuit  model  is  its 
inability  to  represent  combinational  logic  elements  with  data-dependent  propagation  delays.  For 
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example,  if  a  multiplier  can  produce  an  answer  quickly  whenever  one  of  its  inputs  is  zero,  its 
propagation  delay  is  data  dependent.  One  would  like  to  take  advantage  of  the  shorter  delay 
whenever  possible  in  order  to  speed  a  larger  computation. 

While  we  are  unable  to  model  data  dependence  in  the  general  case,  we  can  sometimes  use 
the  extended  circuit  model  of  Section  9  to  partially  model  the  effects.  As  an  example,  in  nMOS 
circuits  [16],  the  transition  of  a  Boolean  signal  from  0  to  1  can  take  much  longer  than  the  transition 
of  the  same  signal  from  1  to  0.  We  can  model  this  situation  somewhat  by  representing  each  wire 
as  two  edges  in  the  graph,  one  representing  the  value  0  on  the  wire  and  the  other  representing  1. 
We  choose  propagation  delays  for  internal  edges  of  a  functional  element  depending  on  how  0 
or  1  inputs  affect  the  output.  Unfortunately,  we  cannot  model  how  the  delays  affect  the  clock 
period  exactly,  but  upper  bounds  can  be  obtained  which  will,  for  example,  properly  model  the 
propagation  delay  through  two  cascaded  inverters. 

There  is  much  more  to  be  understood  about  clocked  circuits.  Do  powerful  combinatorial  op¬ 
timization  techniques  apply  to  other  timing  models  such  as  those  involving  multiphase  clocking 
disciplines?  Can  data-dependent  propagation  delays  be  handled  in  a  reasonably  general  setting? 
Is  it  possible  to  solve  the  state  minimization  problem  with  a  polynomial- time  algorithm  that 
is  simpler  than  the  typical  algorithms  for  solving  minimum-cost  flow  problems?  Can  hierarchi¬ 
cally  described  circuits  be  optimally  retimed  in  time  proportional  to  their  descriptions?  Under 
what  circumstances  can  optimal  retimings  of  parametrized  families  of  circuits  be  algorithmically 
obtained? 

Retiming  is  a  transformation  that  can  be  used  to  produce  efficient  circuits,  and  we  have 
presented  a  variety  of  algorithms  for  automatically  retiming  circuits.  Of  great  interest,  however, 
are  design  methodologies  in  which  retiming  is  performed  by  an  individual  instead  of  an  algorithm, 
as  is  done  in  [11],  [12],  [14],  and  [20].  Retiming  seems  to  be  a  valuable  technique  which  could  be 
incorporated  into  both  circuit  compilers  and  interactive  design  tools. 
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